Citation-based bootstrapping for large-scale author disambiguation

نویسندگان

  • Michael Levin
  • Stefan Krawczyk
  • Steven Bethard
  • Daniel Jurafsky
چکیده

We present a new, two-stage, self-supervised algorithm for author disambiguation in large bibliographic databases. In the first “bootstrap” stage, a collection of highprecision features is used to bootstrap a training set with positive and negative examples of coreferring authors. A supervised feature-based classifier is then trained on the bootstrap clusters and used to cluster the authors in a larger unlabeled dataset. Our selfsupervised approach shares the advantages of unsupervised approaches (no need for expensive hand labels) as well as supervised approaches (a rich set of features that can be discriminatively trained). The algorithm disambiguates 54,000,000 author instances in Thomson Reuters’ Web of Knowledge with B3 F1 of .807. We analyze parameters and features, particularly those from citation networks, which have not been deeply investigated in author disambiguation. The most important citation feature is self-citation, which can be approximated without expensive extraction of the full network. For the supervised stage, the minor improvement due to other citation features (increasing F1 from .748 to .767) suggests they may not be worth the trouble of extracting from databases that don’t already have them. A lean feature set without expensive abstract and title features performs 130 times faster with about equal F

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Author name disambiguation: What difference does it make in author-based citation analysis?

In this paper, we explore how strongly author name disambiguation (AND) affects the results of an author-based citation analysis study, and identify conditions under which the commonly used simplified approach of using surnames and first initials may suffice in practice. We compare author citation ranking and co-citation mapping results in the stem cell research field 2004-2009 between two AND ...

متن کامل

Motif - based success scores in coauthorship networks are highly sensitive to author

Following the work of Krumov et al. [Eur. Phys. J. B 84, 535 (2011)] we revisit the question whether the usage of large citation datasets allows for the quantitative assessment of social (by means of coauthorship of publications) influence on the progression of science. Applying a more comprehensive and well-curated dataset containing the publications in the journals of the American Physical So...

متن کامل

Translation Disambiguation Using Bilingual Bootstrapping

This article proposes a new method for word translation disambiguation, one that uses a machinelearning technique called bilingual bootstrapping. In learning to disambiguate words to be translated, bilingual bootstrapping makes use of a small amount of classified data and a large amount of unclassified data in both the source and the target languages. It repeatedly constructs classifiers in the...

متن کامل

Word Translation Disambiguation Using Bilingual Bootstrapping

This article proposes a new method for word translation disambiguation, one that uses a machinelearning technique called bilingual bootstrapping. In learning to disambiguate words to be translated, bilingual bootstrapping makes use of a small amount of classified data and a large amount of unclassified data in both the source and the target languages. It repeatedly constructs classifiers in the...

متن کامل

Motif-based success scores in coauthorship networks are highly sensitive to author name disambiguation.

Following the work of Krumov et al. [Eur. Phys. J. B 84, 535 (2011)] we revisit the question whether the usage of large citation datasets allows for the quantitative assessment of social (by means of coauthorship of publications) influence on the progression of science. Applying a more comprehensive and well-curated dataset containing the publications in the journals of the American Physical So...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • JASIST

دوره 63  شماره 

صفحات  -

تاریخ انتشار 2012